86 shaares
5 private links
5 private links
1 result
tagged
apache-spark
This is a 6-hour course on Spark, available as one YouTube video. Many things in this video are still relevant in 2022.
- Spark and Mesos were started in AMPLab in UC Berkeley
- Spark was originally a sample app for Mesos, to "spark" an ecosystem of apps for Mesos.
- Spark is an engine for scheduling, monitoring and distributing Big Data
- Hive queries work out of the box in Spark SQL because it uses the same frontend and Metastore
- Spark Streaming has pretty much replaced Apache Storm. Spark in general has replaced most of the Hadoop ecosystem, which consisted of too many tools.
- All higher level things push the compute down to Spark Core
- Streaming, SQL, MLlib, GraphX, BlinkDB, Tachyon
- BlinkDB can do time-limited queries over large amounts of data. Also returns confidence level in the result.
- Tachyon is a memory-based distributed storage system
- Spark's Resource Managers
- YARN
- local-mode
- Mesos (stands for mediator in Greek)
- standalone (datastax Cassandra)
- Resource managers can be made highly-available by storing their configuration in Zookeeper
- Storage: almost any filesystem or NoSQL database. From anything that has a Hadoop Input Format.
- Fun fact: Tableau also uses Spark as its engine
- Hadoop MapReduce constantly needs writing to HDFS. Spark does it all in memory. Hence, it is usually at least 10x faster.
- Recommended reading
- Original Spark white paper and another white paper on RDDs
- "Learning Spark" from O'Reilly is the best book
- Spark packages can be found at https://spark-packages.org
- Integration with other tools, libraries for some tasks.
- Spark Shell always runs on the Driver
- In RDDs, the more partitions you have the more parallelism you have. Each partition requires a task, essentially a JVM thread.
RDDs might even store images and videos.
Theparallelize
method on sparkContext object can be used to create an RDD. Reading from a file can also create RDDs, but the partitions it creates might not be optimal. Need torepartition
.
A transformation like afilter
could produce empty partitions.coalesce
can be used to reduce the number of partitions in an RDD.
.collect
call pushes the data back to the Driver JVM from the workers. Ensure that the Driver has enough memory to do this. It's better to just push the data back to HDFS or Cassandra. Or just collect only a sample of the larger RDD. - All of the transformations in Spark are lazy. It builds a DAG of transformations, but doesn't actually execute the transformations. Only when an action is called does the DAG get executed aka materialized.
- Caching RDDs
- An RDD has to be materialized before it can be cached.
- If not cached, all the RDDs involved in a transformation will be removed. So, what to cache? Whatever RDD is used multiple times. Don't cache the base RDD, maybe cache the cleaned RDD.
- If an RDD doesn't fully fit into memory, Spark will store the rest on disk. Also applies to cached RDDs. Accessing disk would slow down Spark.
- Resource Managers
Standalone, YARN and Mesos have Dynamic Partitioning (no. of executors)
Local has only Static Partitioning - Parallelism: Hadoop Map-Reduce used JVM processes for parallelism, whereas Spark uses JVM threads in its executor JVMs. Hadoop MR had dedicated slots for Map and Reduce operations which was inefficient on resource utilization.
- Over-subscription: Set the number of tasks to be like 2-3x the number of cores on your machine. Setting them equal to the number of cores isn't necessary. Let the Executor JVM handle its threads.
spark-submit
is a command to submit your own Scala programs to the Spark Driver, which submits the work to Spark Master.- Standalone mode (with Datastax Cassandra)
- Master and Worker use little resources compared to the Executors. Masters can be HA and even be added to a live cluster.
- 1 Worker can only run 1 Executor for each application. To have more than one Executor on each machine for a single application, a corresponding number of Workers should be started as well.
- Mounted disks are stored in the
SPARK_LOCAL_DIRS
environment variable. Can store RDDs (fully or partially) on these disks. - It's possible to have a heterogenous composition of machines. Spark will know the number of cores on each machine from the variable
SPARK_WORKER_CORES
in spark-env.sh.
SPARK_WORKER_MEMORY
is the amount of memory that a Spark Worker can give to its Executors. Similarly, a Master's memory is the total amount of memory that it can allocate to its workers. - YARN: Yet Another Resource Negotiator
- Spark can run in either client-mode or cluster-mode on YARN.
In client-mode, the Driver runs on a client machine.
In cluster-mode, the application and the Driver are submitted to the cluster at the same time. No dependency on the client machine. - YARN executor can also dynamically increase and decrease the number of executors.
- Spark can run in either client-mode or cluster-mode on YARN.
- Persistence
- Persisting in memory with serialization can be more space efficient, but costs CPU. Use a very efficient serialization format.
- Caching of RDDs is like an LRU cache. If there isn't enough memory, the oldest RDDs will be evacuated first.
- In memory + disk storage, partitions of an RDD that are the oldest will be moved to the disk in a serialized format. By default the in-memory parts will not be serialized.
MEMORY_ONLY_2
can store the RDD on two different JVMs. Only to be used for extremely expensive RDDs. - Tachyon (now defunct) can be used for
OFF_HEAP
storage. Can be used for sharing RDDS between different Spark applications written in multiple languages. - Shuffle operations automatically persist all the RDDs.
- Local Dirs will be cleaned up in an LRU fashion.
- PySpark objects are always serialized using the Pickle library. Python object representations are not used even when persisted in memory.
- Lineage of RDDs
- RDDs can be produced by narrow transformations or wide transformations.
In narrow transformations, one partition in the parent is used by only one partition in the child. Could be one-to-one or many-to-one. Can be executed in parallel.
In wide transformations, one partition in the child could depend on multiple partitions in the parent. Needs shuffling. Costly to recreate. - Pipelining uses an internal optimization in Spark such that one thread can do multiple transformations on a partition.
toDebugString
prints out the Lineage of an RDD, with stage boundaries represented using indentation
- RDDs can be produced by narrow transformations or wide transformations.
- Shuffling
- repartition, join, cogroup, By and ByKey all cause shuffles.
- numPartitions parameter might cause a shuffle
- coalesce never causes a shuffle.
- Broadcast variables can avoid shuffles. This is a small amount of data, usually a read-only lookup table that is broadcast from the Driver to all the Executors.
Broadcast uses the BitTorrent protocol. - Accumulators are like counters in Hadoop. They count events that occur during job execution. Good for debugging.
They only support associative operations.
Only the Driver program will be able to read an accumulator's value. Tasks don't know this value. - (PySpark (Java API (Spark Core Engine)))
Can use CPython or PyPy, for both Driver and Worker machines. PyPy is better when not using C libraries. - Netty Native Buffer bypasses the kernel buffers and JVM buffers to get data directly from disk to the network. Avoids two copies.
- Shifted from Hash-based Shuffle to Sort-based Shuffle
- Spark Streaming can read data at high rates from various sources, do transformations and then write to various storage options.
- Processing latency must be lower than batch latency, otherwise it will fall behind.
- Spark Streaming uses micro-batches whereas Storm processes each event as it arrives. Storm Trident is like Spark Streaming. Storm uses at-least-once processing, so an event could be processed more than once. The micro-batches in Spark Streaming are called DStreams.
- Batch interval is user configurable.
- Can do sliding window operations on DStreams.